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Abstract 



Ttiis paper presents Natural Evolution Strategics (NES), a recent family of algorithms 
that constitute a more principled approach to black-box optimization than established 
evolutionary algorithms. NES maintains a parameterized distribution on the set of solution 
" ^ , candidates, and the natural gradient is used to update the distribution's parameters in 

' the direction of higher expected fitness. We introduce a collection of techniques that 

00 . address issues of convergence, robustness, sample complexity, computational complexity 

\j ' and sensitivity to hyperparameters. This paper explores a number of implementations of 

the NES family, ranging from general-purpose multi-variate normal distributions to heavy- 
' tailed and separable distributions tailored towards global optimization and search in high 

, dimensional spaces, respectively. Experimental results show best published performance 

on various standard benchmarks, as well as competitive performance on others. 
Keywords: Natural Gradient, Stochastic Search, Evolution Strategies, Black-box Opti- 
mization, Sampling 

! 1. Introduction 

Many real world optimization problems are too difficult or complex to model directly. 
Therefore, they might best be solved in a 'black-box' manner, requiring no additional in- 
formation on the objective function (i.e., the 'fitness' or 'cost') to be optimized besides 
fitness evaluations at certain points in parameter space. Problems that fall within this cat- 



egory are nurn e rous, ranging from applica tions in health and scie nce (IWinter et al 



Shir and Backl. l2007l: Uebaha et al.1 . l2007l l to aeronautic design (jHaseniager et al 



Klockgether and Schwefel . 197d ) and control ( Hansen et al. . 20091 ). 



2005; 



20051 : 



Numerous algorithms in this vein have been developed and applied in the past fifty years 
(see section 2 for an overview), in many cases providing good and even near-optimal solutions 
to hard tasks, which otherwise would have required domain experts to hand-craft solutions 
at substantial cost and often with worse results. The near-infeasibility of finding globally 
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optimal solutions requires a fair amount of heuristics in black-box optimization algorithms, 
leading to a proliferation of sometimes highly performant yet unprincipled methods. 

In this paper, we introduce Natural Evolution Strategies (NES), a novel black-box op- 
timization framework which is derived from first principles and simultaneously provides 
state-of-the-art performance. The core idea is to maintain and iteratively update a search 
distribution from which search points are drawn and their fitness evaluated. The search 
distribution is then updated in the direction of higher expected fitness, using ascent along 
the natural gradient. 



1.1 Continuous Black-Box Optimization 



The problem of black-box optimization has spawned a wide variety of approaches. A first 
class of methods was i nspired by classic o ptimi zation methods, including simplex methods 
such as Nelder-Mead ([Nelder and Meadl . Il965l). as well as menibers of the quasi-Newton 
family of algorithms. Simulated annealing ( Kirkpatrick et al. . 19831 ). a popular method 
introduced in 1983, was inspired by thermodynamics, and is in fact an adaptation of the 
Metropolis-Hastings algorithm. More heuristic methods, such as those inspired by evo- 
lution, have been developed frorn the early 1950 s on. These include t he broad class of 
genet ic algorithms (|Hollandl . Il992l : IColdberel . | Il989t). differential evolution (IStorn and Pried 
1997), estimation of distribution a lgorithms (Larranaga, 20021 ). p article swarm optimiza- 
tion ( Kennedv and Eberhartl . l200l1 ). and the cross-entropy method ( Rubinstein and Kroese , 
200i). 



Evolution str a.tegies (ES), introduced hy Ingo Recheiiberg a nd Hans-Paul Schwefel in the 
1960s and 1970s (jRechenberg and Eigen . Il973l : lschwefe] . ll977l l. were designed to cope with 
high-dimensional continuou s-valued domains and hav e remained an active field of research 
for more than four decades (jBever and Schwefell . |2002| ) . ESs involve evaluating the fitness of 
real-valued genotypes in batch ('generation'), after which the best genotypes are kept, while 
the others are discarded. Survivors then procreate (by slightly mutating all of their genes) 
in order to produce the next batch of offspring. This process, after several generations, 
was shown to lead to reasonable to excellent results for many difficult optimization prob- 
lems. The algorithm framework has been developed extensively over the years to include 
self-adaptation of the search parameters, and the representation of correlated mutations 
by the use of a full covariance matrix. This allowed the framework to capture interre- 
lated dependencies by exploiting the covariances while 'mutating' individuals for the next 
generation. The culminating algor i thm, covariance matrix adaptation evolution strategy 



(CMA-ES: iHansen and Ostermeieri. l200lh. h a s proven successful in numerous studies (e.g.. 



Friedrichs and leel l2005l : iMuller et al.l . I2OO2I : IShepherd et"al] . I2OO6I I 



While evolution strategies prove to be an effective framework for black-box optimiza- 
tion, their ad hoc procedures remain heuristic in nature. Thoroughly analyzing the actual 
dynamics of the procedure turns out to be difficult, the considerable effo rts of various re- 
searchers notwithstanding (iBeyed . I2OOII : iJebalia et al.l . I2OIOI : lAugeri . l2005l ). In other words, 
ESs (including CMA-ES), while powerful, still lack a clear derivation from first principles. 
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1.2 The NES Family 

Natural Evolution Strategies (NES) are well-grounded, evolution-strategy inspired black- 
box optimization algorithms, which instead of maintaining a population of search points, 
iteratively update a search distribution. Like CMA-ES, they can be cast into the framework 
of evolution strategies. 

The general procedure is as follows: the parameterized search distribution is used to 
produce a batch of search points, and the fitness function is evaluated at each such point. 
The distribution's parameters (which include strategy parameters) allow the algorithm to 
adaptively capture the (local) structure of the fitness function. For example, in the case 
of a Gaussian distribution, this comprises the mean and the covariance matrix. From 
the samples, NES estimates a search gradient on the parameters towards higher expected 
fitness. NES then performs a gradient ascent step along the natural gradient, a second- 
order method which, unlike the plain gradient, renormalizes the update w.r.t. uncertainty. 
This step is crucial, since it prevents oscillations, premature convergence, and undesired 
effects stemming from a given parameterization. The entire process reiterates until a stopping 
criterion is met. 

All members of the 'NFS family' operate based on the same principles. They differ in the 
type of distribution and the gradient approximation method used. Different search spaces 
require different search distributions; for example, in low dimensionality it can be highly 
beneficial to model the full covariance matrix. In high dimensions, on the other hand, 
a more scalable alternative is to limit the covariance to the diagonal only. In addition, 
highly multi-modal search spaces may benefit from more heavy-tailed distributions (such as 
Cauchy, as opposed to the Gaussian). A last distinction arises between distributions where 
we can analytically compute the natural gradient, and more general distributions where we 
need to estimate it from samples. 



1.3 Paper Outline 



This paper builds upon an d extends our previous wor k o n Natural Evolution Strategies ([Wierstra et al 



2008 : Sun et al. . 2009al lbl: iGlasmachers et ah . 2010al lbl: ISchaul et al. . 2011 ). and introduces 
novel performance- and robustness-enhancing techniques (in sections 13.31 and 13. 4p , as well 
as an extension to rotationally symmetric distributions (section 15. 2p and a plethora of new 
experimental results. 

The paper is s tructured as follows. S ection [2] presents the general idea of search gradients 
as introduced by Wierstra et al. ( 20081 ). outlining how to perform stochastic search using 
parameterized distributions while doing gradient ascent towards higher expected fitness. 
The limitations of the plain gradient are exposed in section [2^21 and subsequently addressed 
by the introduction of the natural gradient (section 12. 3p , resulting in the canonical NES 
algorithm. 

Section [3] then regroups a collection of techniques that enhance NFS's performance and 
robustness. This includes fitness shaping (designed to render th e algorithm invariant w.r.t. 



order-preserving fitness transformations ( Wierstra et al. . 20081 ). section 13. ip . importance 



mixing ( designed to recyc le samples so as to reduce the number of required fitness eval- 



uations ( Sun et al. . 2009bl ). section [3?2]) . adaptation sampling which is a novel technique 
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for adjusting learning rates online (section I3.3p . and finally restart strategies, designed to 
improve success rates on multi-modal problems (section I3.4p . 

In section m we look in more depth at multivariate Gaussian search distributions, consti- 
tuting the most common case. We show how to constrain the covariances to positive-definite 
matrices using the exponential map (section l4.ip . and how to use a change of coordinates to 
reduce the computational complexity from O(d^) to O(d^), with d bein g the search space 



dimension, resulting in the xNES algorithm (jGlasmachers et al.1 . 12010bl . section 14.2 



Next, in section [5l we develop the breadth of the framework, motivating its usefulness 
and deriving a number of NES variants with different search distributions. First, we show 
how a restriction to diagonal paramete rization perrnits th e approach to scale to very high 



dimensions due to its linear complexity (ISchaul et al.l . l201ll : section [5?T]l . Second, we provide 



a novel formulation of NES for the whole class of multi-variate versions of distributions with 
rotation symmetries ( section l5.2p . including heavy-ta iled distributions with infinite variance, 
such as the Cauchy distribution ()Schaul et al.l . l201ll . section 15. Sp . 

The ensuing experimental investigations show the competitiveness of the approach on a 
broad range of benchmarks (section [6]) . The paper ends with a discussion on the effective- 
ness of the different techniques and types of distributions and an outlook towards future 
developments (section [7]) . 



2. Search Gradients 

The core idea of Natural Evolution Strategies is to use search gradients to update the 
parameters of the search distribution. We define the search gradient as the sampled gradient 
of expected fitness. The search distribution can be taken to be a multinormal distribution, 
but could in principle be any distribution for which we can find derivatives of its log-density 
w.r.t. its parameters. For example, useful distributions include Gaussian mixture models 
and the Cauchy distribution with its heavy tail. 

If we use 6 to denote the parameters of distribution 7r(z | 9) and f{x) to denote the fitness 
function for samples z, we can write the expected fitness under the search distribution as 



J{e)=Ee[f{z)] = I /(z) 7r{z\9) dz. 



(1) 



The so-called 'log-likelihood trick' enables us to write 



VeJie) =Ve J f{z) ^{z\6) dz 
= j f{z) Ve7r(z|0) dz 



/(z) V,7r(z I 0) dz 
7r(z I 9) 

j /(z) V0log7r(z|e) TT{z\9)dz 
Ee [/(z) Velog7r(z|e)]. 
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From this form we obtain the Monte Carlo estimate of the search gradient from samples 
zi . . . za as 

1 ^ 

V,J(0)«-^/(zfe) Velog^(zfc|0), (2) 

k=l 

where A is the population size. This gradient on expected fitness provides a search direction 
in the space of search distributions. A straightforward gradient ascent scheme can thus 
iteratively update the search distribution 

where r/ is a learning rate parameter. Algorithm [1] provides the pseudocode for this very 
general approach to black-box optimization by using a search gradient on search distribu- 
tions. 



Algorithm 1: Canonical Search Gradient algorithm 
input: /, 9init 
repeat 

for A; = 1 ... A do 

draw sample z^ ~ 7r(-|6') 
evaluate the fitness /(z^) 
calculate log-derivatives Ve log 7r(zfc|0) 
end 

1 ^ 

Ve J ^ - ^ Ve logTT{zk\9) ■ /(z^) 

k=l 

until stopping criterion is met 



Utilizing the search gradient in this framework is similar to evolution strategies in that 
it iteratively generates the fitnesses of batches of vector- valued samples - the ES's so-called 
candidate solutions. It is different however, in that it represents this 'population' as a 
parameterized distribution, and in the fact that it uses a search gradient to update the 
parameters of this distribution, which is computed using the fitnesses. 

2.1 Search Gradient for Gaussian Distributions 

In the case of the 'default' d-dimensional multi-variate normal distribution, we collect the 
parameters of the Gaussian, the mean ^ G R"' (candidate solution center) and the covariance 
matrix S G R'^^'^ (mutation matrix), in a single concatenated vector 9 = {fi, 51). However, 
to sample efficiently from this distribution we need a square root of the covariance matrix (a 
matrix A G R'^'"^ fulfilling A^A = S)0 Then z = + A''^s transforms a standard normal 
vector s ~ 7V(0, 1) into a sample z ~ 7V(/^, S). Here, I = diag(l, . . . , 1) G R'^^'^ denotes the 

1. For any matrix Q, denotes its transpose. 
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identity matrix. Let 

7r(z I e) 



(V2^)'^det(A) 



• exp 



• exp 



A-i • (z - 
(z-/i)T5]-i(z-/.; 



denote the density of the multinormal search distribution N{^i, S). 

In order to calculate the derivatives of the log-likelihood with respect to individual 
elements of 6 for this multinormal distribution, first note that 



log(27r) - ^ log det S - i (z - (z - 



log vr {z\9) 

We will need its derivatives, that is, V^log7r(z|0) and Vs log vr (z|0). The first is trivially 



V^log7r(z|^) 



1-1 



while the latter is 



Vslog7r(z|^) = -S-i(z-/x) (z-/x: 



2 



(3) 



(4) 



In order to preserve symmetry, to ensure non-negative variances and to keep the mutation 
matrix X) positive definite, S needs to be constrained. One way to accomplish that is by 
representing S as a product 5] = A^A (for a more sophisticated solution to this issue, see 
section HTTI) . Instead of using the log-derivatives on Vslogvr(z) directly, we then compute 
the derivatives with respect to A as 



Va log vr (z) 



Vs log vr (z) + Vs log IT (z) 



Using these derivatives to calculate J, we can then update parameters 6 = {fj,, XI = 
A'''A) as 6 <^ 9 + TjSJgJ using learning rate r]. This produces a new center /x for the 
search distribution, and simultaneously self-adapts its associated covariance matrix S. To 
summarize, we provide the pseudocode for following the search gradient in the case of a 
multinormal search distribution in Algorithm [2l 

2.2 Limitations of Plain Search Gradients 

As the attentive reader will have realized, there exists at least one major issue with applying 
the search gradient as-is in practice: It is impossible to precisely locate a (quadratic) opti- 
mum, even in the one-dimensional case. Let d = 1, = {f^,cr), and samples z ~ M{fi,a). 
Equations ([3]) and ([!]), the gradients on fi and o", become 



z — 

C73 
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Algorithm 2: Search Gradient algorithm: Multinormal distribution 
input: /, ^l^^n,^T.init 
repeat 

for A; = 1 ... A do 

draw sample ~ AA(/x, S) 
evaluate the fitness /(z^) 
calculate log-derivatives: 
V^log7r(zfc|e) = S-i(zfc-/x) _ 
Vs logvr (zfc|0) = -iS-i + is ' (zfc - /x) (zfc - /^)^ 

end 

^ i ELi log ^(zfel^) • /(zfc) 
Vs J ^ i ELi log 7r(zfe|^) • /(z^) 

S ^ S + r?- VsJ 

until stopping criterion is met 



and the updates, assuming simple hill-climbing (i.e. a population size A = 1) read: 

z — 



a ^ a + rj- 



a3 



For any objective function / that requires locating an (approximately) quadratic optimum 
with some degree of precision (e.g. /(z) = z^), a must decrease, which in turn increases the 
variance of the updates, as A/i oc ^ and Aa oc ^ for a typical sample z. In fact, the updates 
become increasingly unstable, the smaller a becomes, an effect which a reduced learning 
rate or an increased population size can only delay but not avoid. Figure [1] illustrates this 
effect. Conversely, whenever o" ^ 1 is large, the magnitude of a typical update is severely 
reduced. 

Clearly, this update is not at all scale-invariant: Starting with cr » 1 makes all updates 
minuscule, whereas starting with a <^ 1 makes the first update huge and therefore unstable. 

We conjecture that this limitation constitutes one of the main reasons why search gra- 
dients have not been developed before: in isolation, the plain search gradient's performance 
is both unstable and unsatisfying, and it is only the application of natural gradients (intro- 
duced in section 12. 3p which tackles these issues and renders search gradients into a viable 
optimization method. 

2.3 Using the Natural Gradient 

Instead of using the plain stochastic gradient for updates, NES follows the natural gradient. 
The natural gradient was first introduced by Amari in 1998 , and has been shown to po ssess 
numerous advantages over the plain gradient ( Amari . 19981 : Amari and Doug k^. ll998l V In 



terms of mitigating the slow convergence of plain gradient ascent in optimization landscapes 
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Figure 1: Left: Schematic illustration of how the search distribution adapts in the one- 
dimensional case: from (1) to (2), /i is adjusted to make the distribution cover 
the optimum. From (2) to (3), cr is reduced to allow for a precise localization 
of the optimum. The step from (3) to (1) then is the problematic case, where 
a small a induces a largely overshooting update, making the search start over 
again. Right: Progression of ^ (black) and cr (red, dashed) when following the 
search gradient towards minimizing /(z) = z^, executing Algorithm [2l Plotted 
are median values over 1000 runs, with a small learning rate rj = 0.01 and A = 10, 
both of which mitigate the instability somewhat, but still show the failure to 
precisely locate the optimum (for which both /i and a need to approach 0). 



with ridges and plateaus, natural gradients are a more principled (and hyper-parameter- 
free) approach than, for example, the commonly used momentum heuristic. 

The plain gradient VJ simply follows the steepest ascent in the space of the actual 
parameters 9 of the distribution. This means that for a given small step-size e, following 
it will yield a new distribution with parameters chosen from the hypersphere of radius e 
and center 9 that maximizes J. In other words, the Euclidean distance in parameter space 
between subsequent distributions is fixed. Clearly, this makes the update dependent on 
the particular parameterization of the distribution, therefore a change of parameterization 
leads to different gradients and different updates. See also Figure [2] for an illustration of 
how this effectively renormalizes updates w.r.t. uncertainty. 

The key idea of the natural gradient is to remove this dependence on the parameter- 
ization by relying on a more 'natural' measure of distance D{9'\\9) between probability 
distributions 7r(z|0) and ■k{z\9'). One such natural dis tance measure between two prob- 
ability distributions is the Kullback-Leibler divergence dKullback and Leibleil . [l95ll v The 



natural gradient can then be formalized as the solution to the constrained optimization 
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ure 2: Illustration of plain versus natural gradient in parameter space. Consider two 
parameters, e.g. 6 = {lJ-,cr), of the search distribution. In the plot on the left, 
the solid (black) arrows indicate the gradient samples log 7r(z | 0), while the 
dotted (blue) arrows correspond to /(z) • Velog7r(z | 6), that is, the same gradi- 
ent estimates, but scaled with fitness. Combining these, the bold (green) arrow 
indicates the (sampled) fitness gradient V^J, while the bold dashed (red) arrow 
indicates the corresponding natural gradient J. 

Being random variables with expectation zero, the distribution of the black arrows 
is governed by their covariance, indicated by the gray ellipse. Notice that this 
covariance is a quantity in parameter space (where the 9 reside), which is not to 
be confused with the covariance of the distribution in the search space (where the 
samples z reside). 

In contrast, solid (black) arrows on the right represent Vg log tt{z\9), and dotted 
(blue) arrows indicate the natural gradient samples /(z) • V6ilog7r(z | 9), resulting 
in the natural gradient (dashed red). 

The covariance of the solid arrows on the right hand side turns out to be the 
inverse of the covariance of the solid arrows on the left. This has the effect that 
when computing the natural gradient, directions with high variance (uncertainty) 
are penalized and thus shrunken, while components with low variance (high cer- 
tainty) are boosted, since these components of the gradient samples deserve more 
trust. This makes the (dashed red) natural gradient a much more trustworthy 
update direction than the (green) plain gradient. 
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problem 

max J {6 + 66)^ J (9) + 69^VgJ, 

SO 

s.t. D{e + 6e\\e) = e, 

where J (9) is the expected fitness of equation ([TJ, and e is a small increment size. Now, 
we have for 59 ^ 0, 

1 



where 



E 



D{9 + 59\\9) = -59 (9) 59, 



TT {z\9) Vg log TT {z\9) Vg log TT {z\9) ' dz, 



Vg log7r(z|6') Ve log7r(z|6') 



is the Fisher information matrix of the given parametric fam ily of search distributions. 
The solution to this can be found using a Lagrangian multiplier (|Petersl . [2007l ). yielding the 
necessary condition 

¥59 = (3VeJ, (5) 

for some constant /3 > 0. The direction of the natural gradient VgJ is given by 59 thus 
defined. If F is invertibl^, the natural gradient amounts to 

VgJ = F-^VgJ{9). 

The Fisher matrix can be estimated from samples, reusing the log-derivatives V0log7r(z|^) 
that we already computed for the gradient VgJ. Then, updating the parameters following 
the natural gradient instead of the steepest gradient leads us to the general formulation of 
NES, as shown in Algorithm [3l 



3. Performance and Robustness Techniques 

In the following we will present and intro duce crucial techniqu es to improves NES's perfor- 
mance and robustness. Fitness shaping (jWierstra et al.l . l2008l ) is designed to make the al- 
gorithm invariant w. r.t. arbitra r y yet o rder-preserving fitness transformations (section l3.ip . 
Importance mixing (jSun et al.l . l2009bl ) is designed to recycle samples so as to reduce the 
number of required fitness evaluations, and is subsequently presented in section [3^21 Adap- 
tation sampling, a novel technique for adjusting learning rates online, is introduced in 
section [3131 and finally restart strategies, designed to improve success rates on multimodal 
problems, is presented in section [331 



3.1 Fitness Shaping 

NES utilizes rank-based fitness shaping in order to render the algorithm invariant under 
monotonically increasing (i.e., rank preserving) transformations of the fitness function. For 



2. Care has to be taken because the Fisher matrix estimate may not be (numerically) invertible even if the 
exact Fisher matrix is. 
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Algorithm 3: Canonical Natural Evolution Strategies 
input: /, Oinit 
repeat 

for A; = 1 ... A do 

draw sample ~ 7r(-|^) 
evaluate the fitness /(z^) 
calculate log-derivatives log 7r(zfc|0) 
end 

J ^ i ELi log ^(zfcl^) • /(zfc) 
1 ^ 

F ^ - J] Ve log ^ (zfc 1 0) Ve log vr (z,, 1 0) ^ 

k=l 

until stopping criterion is met 



this purpose, the fitness of the population is transformed into a set of utility values ui > 
■ ■ • ^ ^A- Let Zj denote the i*'* best individual (the i^^ individual in the population, sorted 
by fitness, such that zi is the best and zx the worst individual). Replacing fitness with 
utility, the gradient estimate of equation ([2|) becomes, with slight abuse of notation, 

A 

VeJ{e) = ^UkVelog7r{zk\e). (6) 



k=l 



To avoid entangling the utility- weightings with the learning rate, we require that X]fc=i l^fel = 
1. 

The choice of utility function can in fact be seen as a free parameter of the algorithm. 
Throughout this paper we will use the following 

max(0,log(^-hl)-log(i)) 1 

Uk = T ^ — : , 

V^^, max (0, log(f + 1) - log(j)) A 



which is directly related to the one employed by CMA-ES (jHansen and Ostermeierl . 120011 ). 
for ease of comparison. In our experience, however, this choice has not been crucial to 
performance, as long as it is monotonous and based on ranks instead of raw fitness (e.g., a 
function which simply increases linearly with rank). 

In addition to robustness, these utility values provide us with an elegant formalism to 
describe the (1+1) hill-climber version of the algorithm within the same framework, by 
using different utility values, depending on success (see section H3] later in this paper). 

3.2 Importance Mixing 

In each batch, we evaluate A new samples generated from search distribution vr {z\9). How- 
ever, since small updates ensure that the KL divergence between consecutive search distri- 
butions is generally small, most new samples will fall in the high density area of the previous 
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Figure 3: Illustration of importance mixing. Left: In the first step, old samples are elimi- 
nated (red triangles) according to ([7|), and the remaining samples (blue triangles) 
are reused. Right: In the second step, new samples (green circles) are generated 
according to 



search distribution vr {z\9'). This leads to redundant fitness evaluations in that same area. 
We improve the efficiency with a new procedure called importance mixing, which aims 
at reusing fitness evaluations from the previous batch, while ensuring the updated batch 
conforms to the new search distribution. 

Importance mixing works in two steps: In the first step, rejection sampling is performed 
on the previous batch, such that sample z is accepted with probability 

.ni4l.(l-a)^|. (7) 



vr (z| 

Here a G [0, 1] is an algorithm hyperparameter called the minimal refresh rate. Let be the 
number of samples accepted in the first step. In the second step, reverse rejection sampling 
is performed as follows: Generate samples from tt {z\0) and accept z with probability 



until A — Aa new samples are accepted. The Aa samples from the old batch and A — Aq newly 
accepted samples together constitute the new batch. Figure [3] illustrates the procedure. 
Note that only the fitnesses of the newly accepted samples need to be evaluated. The 
advantage of using importance mixing is twofold: On the one hand, we reduce the number 
of fitness evaluations required in each batch, on the other hand, if we fix the number of 
newly evaluated fitnesses, then many more fitness evaluations can potentially be used to 
yield more reliable and accurate gradients. 

The minimal refresh rate parameter a has two uses. First, it avoids too low an accep- 
tance probability at the second step when vr {z\6') /vr {z\9) ~ 1. And second, it permits spec- 
ifying a lower bound on the expected proportion of newly evaluated samples p = K [^-^^] , 
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namely p > a, with the equahty holding if and only if = 9' . In particular, if a = 1, all 
samples from the previous batch will be discarded, and if a = 0, p depends only on the 
distance between ir {z\9) and 7r{z\9'). Normally we set a to be a small positive number, 
e.g., in this paper we use a = 0.1 throughout. 

It can be proven that the updated batch conforms to the search distribution vr {z\9). In 
the region where 

the probability that a sample from previous batches appears in the new batch is 

n{z\9')-{l-a)^^^={l-a)7r{z\9). 

The probability that a sample generated from the second step entering the batch is avr {z\9), 
since 



max <a,l } = a. 

So the probability of a sample entering the batch is just p{z\9) in that region. The same 
result holds also for the region where 

When using importance mixing in the context of NES, this reduces the sensitivity to 
the hyperparameters, particularly the population size A, as importance mixing implicitly 
adapts it to the situation by reusing some or many previously evaluated sample points. 

3.3 Adaptation Sampling 

To reduce the burden on determining appropriate hyper-param eters such as the learning 



rate, w e develop a new self-adaptation or meta-learning technique (jSchaul and Schmidhuber 



called adaptaUon sampUng, that can automatically ad^pt th^ settings in a principled 



and economical way. 

We model this situation as follows: Let irg be a distribution with hyper-parameter 9 and 
tp{z) a quality measure for each sample z ~ vrg. Our goal is to adapt 9 such as to maximize 
the quality A straightforward method to achieve this, henceforth dubbed adaptation 
sampling, is to evaluate the quality of the samples z' drawn from irgi, where 9' ^ 9 is a 
slight variation of 9, and then perform hill-climbing: Continue with the new 9' if the quality 
of its samples is significantly better (according, e.g., to a Mann- Whitney U-test), and revert 
to 9 otherwise. Note that this proceeding is similar to the NES algorithm itself, but applied 
at a meta- level to algorithm parameters instead of the search distribution. The goal of this 
adaptation is to maximize the pace of progress over time, which is slightly different from 
maximizing the fitness function itself. 

Virtual adaptation sampling is a lightweight alternative to adaptation sampling that is 
particularly useful whenever evaluating is expensive : 
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do importance sampling on the existing samples Zj, according to ttq/: 



w' 



TT Z 



7r(z|e) 

(this is always well-defined, because z ~ vre =^ ■k[z\6) > 0). 

• compare {'ip{zi)} with weights {wi = l,Vi} and = i{j{zi),\/i} with weights {w[}, 
using a weighted version of the Mann- Whitney test, as introduced in Appendix A. 

Beyond determining whether 6 or 6' is better, choosing a non-trivial confidence level p 
allows us to avoid parameter drift, as 9 is only updated if the improvement is significant 
enough. 

There is one caveat, however: the rate of parameter change needs to be adjusted such 
that the two resulting distributions are not too similar (otherwise the difference won't be 
statistically significant), but also not too different, (otherwise the weights w' will be too 
small and again the test will be inconclusive). 

If, however, we explicitly desire large adaptation steps on 0, we have the possibility of 
interpolating between adaptation sampling and virtual adaptation sampling by drawing a 
few new samples from the distribution -Kg' (each assigned weight 1), where it is overlapping 
least with ng. The best way of achieving this is importance mixing, as introduced in 
Section [3121 uses jointly with the reweighted existing samples. 

For NES algorithms, the most important parameter to be adapted by adaptation sam- 
pling is the learning rate r/, starting with a conservative guess. This is because half-way 
into the search, after a local attractor has been singled out, it may well pay off to increase 
the learning rate in order to more quickly converge to it. 

In order to produce variations rj' which can be judged using t he above-mentioned li- 



test, we propose a procedu re similar in spirit to Rprop-updates (iRiedmiller and Braun . 



19931 : lleel and Huskenl . I2OO3I I , where the learning rates are either increased or decreased by 



a multiplicative constant whenever there is evidence that such a change will lead to better 
samples. 

More concretely, when using adaptation sampling for NES we test for an improvement 
with the hypothetical distribution 6' generated with r]' = 1.5?]. Each time the statistical 
test is successful with a confidence of at least P = ^ — ^(^d+i) (this value was determined 
empirically) we increase the learning rate by a factor of = 1.1, up to at most r/ = 1. 
Otherwise we bring it closer to its initial value: r] ^ 0.9r] + 0.1?7mit- 

Figure H] illustrates the effect of the virtual adaptation sampling strategy on two different 
10-dimensional unimodal benchmark functions, the Sphere function fi and the Rosenbrock 
function /§ (see section [62] for details). We find that, indeed, adaptation sampling boosts 
the learning rates to the appropriate high values when quick progress can be made (in the 
presence of an approximately quadratic optimum), but keeps them at carefully low values 
otherwise. 

3.4 Restart Strategies 

A simple but widespread method for mitigating the risk of finding only local optima in a 
strongly multi-modal scenario is to restart the optimization algorithm a number of times 
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20000 



20000 



Figure 4: Illustration of the effect of adaptation sampling. We show the increase in fitness 
during a NES run (above) and the corresponding learning rates (below) on two 
setups: 10-dimensional sphere function (left), and 10-dimensional Rosenbrock 
function (right). Plotted are three variants of xNES (algorithm fixed default 
learning rate of = 0.1 (dashed, red) fixed large learning rate of r] = 0.5 (dotted, 
yellow), and an adaptive learning rate starting at i] = 0.1 (green). We see that 
for the (simple) Sphere function, it is advantageous to use a large learning rate, 
and adaptation sampling automatically finds that one. However, using the overly 
greedy updates of a large learning rate fails on harder problems (right). Here 
adaptation sampling really shines: it boosts the learning rate in the initial phase 
(entering the Rosenbrock valley), then quickly reduces it while the search needs 
to carefully navigate the bottom of the valley, and boosts it again at the end 
when it has located the optimum and merely needs to zoom in precisely. 



with different initializations, or just with a different random seed. This is even more useful 
in the presence of parallel processing resources, in which case multiple runs are executed 
simultaneously. 

In practice, where the parallel capacity is limited, we need to decide when to stop or 
suspend an ongoing optimization run and start or resume another one. In this section we 
provide one such restart s trategy that takes those decision s. Inspired by recent work on 
practical universal search ( Schaul and Schmidhuber , 2010bl ). this results in an effective use 
of resources independently of the problem. 

The strategy consists in reserving a fixed fraction p of the total time for the first run, 
and then subdividing the remaining time 1 — p in the same way, recursively (i.e. p{l — p)*~^ 
for the i''^ run). The recursive decomposition of the time budget stops when the assigned 
time-slice becomes smaller than the overhead of swapping out different runs. In this way, 
the number of runs with different seeds remains finite, but increases over time, as needed. 
Figure [5] illustrates the effect of the restart strategy, for different values of p, on the example 
of a multi- modal benchmark function /ig (see section 16.21 for details) , where most runs get 
caught in local optima. Whenever used in the rest of the paper, the fraction is p = -g. 
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xNES on BBOB /j. 
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Figure 5: Illustrating the effect of different restart strategies. Plotted is the cumulative 
empirical success probability, as a function of the total number of evaluations 
used. Using no restarts, corresponding to p = 1 (green) is initially faster but 
unreliable, whereas p = jq (dotted black) reliably finds the optimum within 
300000 evaluations, but at the cost of slowing down success for all runs by a factor 
10. In-between these extremes, P = ^ (broken line, red) trades off slowdown and 
reliability. 



Let s{t) be the success probability of the underlying search algorithm at time t. Here, 
time is measured by the number of generations or fitness evaluations. Accordingly, let Sp{t) 
be the boosted success probability of the restart scheme with parameter p. Approximately, 
that is, assuming continuous time, the probabilities are connected by the formula 

oo 

Sp{t) = i-l[[i-s{p{i-py-H) . 

1=1 

Two qualitative properties of the restart strategy can be deduced from this formula, even 
if in discrete time the sum is finite (i < log2(t)), and the times p{l — py~^t need to be 
rounded: 

• If there exists to such that s(to) > then lim Sp{t) = 1 for all p G (0, !)■ 

• For sufficiently small t, we have Sp{t) < Sp{t). 

This captures the expected effect that the restart strategy results in an initial slowdown, 
but eventually solves the problem reliably. 

4. Techniques for Multinormal Distributions 

In this section we will describe two crucial techniques to enhance performance of the NFS 
algorithm as applied to multinormal distributions. First, the method of exponential param- 
eterization is introduced, guaranteeing that the covariance matrix stays positive-definite. 
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Second, a novel method for changing the coordinate system into a "natural" one is laid out, 
making the algorithm computationally efficient. 



4.1 Using Exponential Parameterization 

Gradient steps on the covariance matrix XI result in a number of technical problems. When 
updating I] directly with the gradient step VsJ, we have to ensure that 5] + r/Vs«/ remains 
a valid, positive definite covariance matrix. This is not guaranteed a priori, because the 
(natural) gradient Vs J may be any symmetric matrix. If we instead update a factor A 
of S, it is at least ensured that A^A is symmetric and positive semi-definite. But when 
shrinking an eigenvalue of A it may happen that the gradient step swaps the sign of the 
eigenvalue, resulting in undesired oscillations. 

An elegant way to fix these problems is to re present the covariance matr ix using the 
exponential map for symmetric matrices (see e.g., iGlasmachers and Igell . 120051 for a related 
approach). Let 



5rf := M G 



ndxd 



M' 



M 



and 



v'^Mv > for all v G 



\{0}} 



denote the vector space of symmetric and the (cone) manifold of symmetric positive definite 
matrices, respectively. Then the exponential map 



exp -.Sd-^Vd, 



n=0 



(9) 



is a diffeomorphism: The map is bijective, and both exp as well as its inverse map log : Vd — ?■ 
Sd are smooth. The mapping can be computed in cubic time, for example by decomposing 
the matrix M = UDU^ into orthogonal U (eigen- vectors) and diagonal D (eigen- values) , 
taking the exponential of D (which amounts to taking the element-wise exponentials of the 
diagonal entries), and composing everything baclJl as exp(M) = Uexp(D)U^. 

Thus, we can represent the covariance matrix S G as exp(M) with M G Sd- The 
resulting gradient update for M has two important properties: First, because Sd is a vector 
space, any update automatically corresponds to a valid covariance matrix|l| Second, the 
update of M makes the gradient invariant w.r.t. linear transformations of the search space 
M"^. This follows from an information geometric perspective, viewing Vd as the Riemannian 
parameter manifold equipped with the Fisher information metric . The invariance property 
is a direct consequence of the Cartan-Hadamard theorem (Cartan, 19281 ). See also Figure [6] 
for an illustration. 

However, the exponential parameterization considerably complicates the computation 
of the Fisher information matrix F, which now involves partial derivatives of the ma- 
trix exponential ([9]). This can be done in cubic time per partial derivative according 



3. The same computation works for the logarithm, and thus also for powers Vd ^ Vd, ^ M'^ — 
exp(c • log(M)) for all c £ R, for example for the (unique) square root (c = 1/2). 

4. The tangent bundle TVd of the manifold Vd is isomorphic to VdXSd and globally trivial. Thus, arbitrarily 
large gradient steps are meaningful in this representation. 
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Figure 6: Left: updating a covariance matrix X) directly can end up outside the manifold 
of symmetric positive-definite matrices Vd- Right: first performing the update 
on M = i log(S) in 5^ and then mapping back the result into the original space 
Vd using the exponential map is both safe (guaranteed to stay in the manifold) 
and straight (the update follows a geodesic). 



to ( Naifeld and Havell . 11994 ). resulting in an unacceptable complexity of 0{d7) for the 



computation of the Fisher matrix. 
4.2 Using Natural Coordinates 

In this section we describe a technique that allows us to avoid the computation of the Fisher 
information matrix altogether, for some specific but common classes of distributions. The 
idea is to iteratively change the coordinate system in such a way that it becomes possible 
to follow the natural gradient without any costly inverses of the Fisher matrix (actually, 
without even constructing it explicitly). We introduce the technique for the simplest case 
of multinormal search distributions, and in section 15.21 we generalize it to the whole class 
of distributions that they are applicable to (namely, rotationally-symmetric distributions). 

Instead of using the 'global' coordinates S = exp(M) for the covariance matrix, we 
linearly transform the coordinate system in each iteration to a coordinate system in which 
the current search distribution is the standard normal distribution with zero mean and 
unit covariance. Let the current search distribution be given by (/Lt, A) G M'^ x "P^ with 
A'''A = S. We use the tangent space T(^fj,^A)0^'^ 'x 'Pd) of the parameter manifold M'^ x Vd-, 
which is isomorphic to the vector space W^xSd, to represent the updated search distribution 
as 

(d, M) ^(/^,,„, Anew) = (m + AT<5, Aexp Qm^) • (10) 

This coordinate system is natural in the sense that the Fisher matrix w.r.t. an orthonor- 
mal basis of (5, M) is the identity matrix. The current search distribution A/'(/x, A^ A) is 
encoded as 



7r(z|(5, m)=^^(^^l + a^5, a^ exp(M) a) 



18 



Natural Evolution Strategies 



where at each step we change the coordinates such that (5,M) = (0,0). In this case, it is 
guaranteed that for the variables {S, M) the plain gradient and the natural gradient coincide 
(F = I). Consequently the computation of the natural gradient costs 0{d^) operations. 

In the new coordinate system we produce standard normal samples s ~ M{0, 1) which 
are then mapped back into the original coordinates z = fx + A^-s. The log-density becomes 



log 7r(z I (5, M) = - ^ log(2^) - log ( det(A)) - ^ tr(M) 



exp (-^m) A-T-(z-(/i + A^d)) 



and thus the log-derivatives (at 5 = 0, M = 0) take the following, surprisingly simple forms: 



V,|,^„log^(z|M = 0,5) = V,|,^„ 



2 ■ A-^A^ ■ A-^ ■{z-{n + A^5)) 



A-T(z-/x) 



VmIm=o log^(z|(5 = 0,M) = --V^U^o 



tr(M) + ex.p[ -^M] A'~^{z- fi] 



(11) 



I + 2 • ( [A-"^(z - m)] • exp (-^M ) • [A-^(z - /x) 



[l-[A-^(z-/x)].I.[A-^(z-/x)]" 



1 



(ss^ - 1) 



(12) 



These results give us the updates in the natural coordinate system 



V5J =^/(zfc) • Sfc 
k=l 
A 

VMJ=J]/(zfe)-(SfcsJ-I) 



(13) 
(14) 



k=l 
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which are then mapped back onto (/x, A) using equation (jlOp : 



A • exp ( -M ) = A • exp (r]-'VMJ 



A - exp ^1 • ^^VMlog7r(zfc|6') • /(zfe)j 



4.3 Orthogonal Decomposition of Multinomial Parameter Space 

We decompose the parameter vector space (5,M) G M'^ x 5^ into the product 

M"' X 5d =^x 5j X 5j- , (15) 
(^) ^ (B) 

of orthogonal subspaces. The one-dimensional space 5^ = {A • 1 1 A £ M} is spanned by 
the identity matrix I, and 5^ = {M E 5^ | tr(M) = 0} denotes its orthogonal complement 
in Sd- The different components have roles with clear interpretations: The ((5)-component 

J describes the update of the center of the search distribution, the (cr)-component with 
value Vo- J • I for V^-J = tr(VM</)/c? has the role of a step size update, which becomes clear 
from the identity det(exp(M)) = exp(tr(M)), and the (B)-component VbJ = Vm.J — ^ uJ 
describes the update of the transformation matrix, normalized to unit determinant, which 
can thus be attributed to the shape of the search distribution. This decomposition is 
canonical in being the finest decomposition such that updates of its components result in 
invariance of the search algorithm under linear transformations of the search space. 

On these subspaces we introduce independent learning rates r^o-, and ryBi respectively. 
For simplicity we also split the transformation matrix A = a • B into the step size a G 
and the normalized transformation matrix B with det(B) = 1. Then the resulting update 
is 

A 

A^new = tJ' + VS- V5 J = At + r?5 • XI " (1^) 

fc=l 

(r]a \ fr]a tr(VMJ)\ 

O-new = O- • exp -V^j = CJ • CXp ( ^ ^ I (17) 

B_ = B . exp (f . VbJ) = B . exp (f . (vmJ - • l)) , (18) 



20 



Natural Evolution Strategies 



with VmJ from equation (fT^ . In case of t]^ = r/B, in this case referred to as 77a, the 
updates (fT7|) and ([T8|) simpUfy to 

Anew = A • exp • Vm) (19) 

= A.exp |^!|i.J]^/(z,).(s,sI-I)^. 

The resulting algorithm is cahed exponential NES (xNES), and shown in Algorithm [H We 
also give the pseudocode for its hill-climber variant (see also section 14. Sp . 

Updating the search distribution in the natural coordinate system is an alternative to the 
exponential parameterization (section 14. ip for making the algorithm invariant under linear 
transformations of the search space, which is then achieved in a direct and constructive 
way. 



^|det(A)| 



Algorithm 4: Exponential Natural Evolution Strategies (xNES), for multinormal 

distributions 

input: /, /ij^ji, 'Einit = A^A 

initialize „ . , 
B ^ A/a 

repeat 

for A; = 1 ... A do 

draw sample ~ AA(0, 1) 
Zfc ^ /X + crB'^Sfc 
evaluate the fitness /(z^) 
end 

sort {(sfc,Zfc)} with respect to /(z^) and compute utilities 

comnute gradients ^^'^ ^ ^^=i ""^ ' ^^'^ ^ ^^=i ""^ ' ^^^^^ " 

compute gradients ^ ^^^^^^^^^ 

/I ^ /X + % • crB • VsJ 

update parameters a a • exp(r/o-/2 • V^J) 
B^B-exp(r?B/2-VBJ) 
until stopping criterion is met 



4.4 Connection to CMA-ES. 

It has been noticed in dependently by ( Glasmachers et al. . 2010al ) and shortly afterwards by 
(|Akimoto et al.l . I2OI0I I that the natural gradient updates o f xNES and the strategy updates 
of the CMA-ES algorithm ( Hansen and Ostermeier . 200 ll ) are closely connected. However, 
since xNES does not feature evolution paths, this connection is restricted to the so-called 
rank- /i- update (in the terminology of this study, rank- A- update) of CMA-ES. 
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Algorithm 5: (l+l)-xNES 
input: /, Hi^n, Yiinu = A^A 

fmax ^ OO 

repeat 

draw sample s ~ A/'(0, 1) 

evaluate the fitness /(z = A'''s + /x) 

calculate log-derivatives Vm logvr {z\9) = ^ (ss""" — l) 

(Z1,Z2) ^ {ti,z) 

if fmax < /(z) then 

fmax ^ /(z) 

u^(-4,l) 
update mean ^ z 
else 

I u^(|,0) 
end 

Vm J ^ ^ ELi Vm log^ (z^I^) • = -i^I + m (ss^ _ i) 

A ^ A • exp {^tjVmJ) 
until stopping criterion is met 



First of all observe that xNES and CMA-ES share the same invariance properties. But 
more interestingly, although derived from different heuristics, their updates turn out to be 
nearly equivalent. A closer investigation of this equivalence promises synergies and new 
perspectives on the working principles of both algorithms. In particular, this insight shows 
that CMA-ES can be explained as a natural gradient algorithm, which may allow for a more 
thorough analysis of its updates, and xNES can profit from CMA-ES's mature settings of 
algorithms parameters, such as search space dimension-dependent population sizes, learning 
rates and utility values. 



Both xNES and CMA-ES parameterize the search distribution with three functionally 
different parameters for mean, scale, and shape of the distribution. xNES uses the param- 
eters /i, (T, and B, while the covariance matrix is represented as o"^ • C in CMA-ES, where 
C can by any positive definite symmetric matrix. Thus, the representation of the scale of 
the search distribution is shared among a and C in CMA-ES, and the role of the additional 
parameter a is to allow for an adaptation of the step size on a faster time scale than the 
full covariance update. In contrast, the NES updates of scale and shape parameters a and 
B are properly decoupled. 



Let us start with the updates of the center parameter /x. The up date (flGl) is very similar 
to th e update of the center of the search distribution in CMA-ES, see (jHansen and Ostermeieii . 



200l|). The utility function exactly takes the role of the weights in CMA-ES, which assumes 



a fixed learning rate of one. 
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For the covariance matrix, the situation is more comphcated. From equation ()19p we 
deduce the update rule 

^new — (A-new) ' -Anc^ 

A \ 
" ' ^ ' -A 



=A''' • exp ^r?s • ^ Uk (sksj - I^^ 



for the covariance matrix, with learning rate rjs = 77a- This term is closely connected 
to the exponential parameterization of the natural coordinates in xNES, while CMA-ES 
is formulated in global linear coordinates. The connection of these updates can be shown 
either by applying the xNES update directly to the natural coordinates without the expo- 
nential para meterization, or by ap proximating the exponential map by its first order Taylor 
expansion. (jAkimoto et all , boioi ) established the same connecti on directly in coordinates 



based on the Cholesky decomposition of see dSun et al.l . r2nn9al lbh. The arguably simplest 



derivation of the equivalence relies on the invariance of the natural gradient under coor- 
dinate transformations, which allows us to perform the computation, w.l.o.g., in natural 
coordinates. We use the first order Taylor approximation of exp to obtain 

exp iii^-^Uk (sksj - J I + r/E • ^ Ufc (^s^sj - I 

\ k=l / k=l 

SO the first order approximate update yields 

A 

= (1 - [/ • ?7s) • A' A + rj^.Y^Uk (A"^Sfe) (a^s,. 

k=l 

A 

= (!-[/• ??s) • S + ??s • ^ Uk (zfc - /^) (zfc - fJ-V 

k=l 

wit h U = y^,f„_iUk, from which th e connection to the CMA-ES rank-/i-update is obvious 



(see Hansen and Ostermeier . 200 ll ). 



Finally, the updates of the global step size parameter a turn out to be identical in xNES 
and CMA-ES without evolution paths. 

Having highlighted the similarities, let us have a closer look at the differences between 
xNES and CMA-ES, which are mostly two aspects. CMA-ES uses the well-established 
technique of evolution paths to smoothen out random effects over multiple generations. 
This technique is particularly valuable when working with minimal population sizes, which 
is the default for both algorithms. Thus, evoluti on paths are expected to impro ve stability; 



further interpretations have been provided by (iHansen and Ostermeied . l200ll ). However, 
the presence of evolution paths has the conceptual disadvantage that the state of the CMA- 
ES algorithms is not completely described by its search distribution. The other difference 
between xNES and CMA-ES is the exponential parameterization of the updates in xNES, 
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which results in a multiphcative update equation for the covariance matrix, in contrast to 
the additive update of CMA-ES. We argue that just like for the global step size a, the 
multiplicative update of the covariance matrix is natural. 

A valuable perspective offered by the natural gradient updates in xNES is the deriva- 
tion of the updates of the center /x, the step size a, and the normalized transformation 
matrix B, all from the same principle of natural gradient ascent. In contrast, the updates 
applied in CMA-ES result from different heuristics for each parameter. Hence, it is even 
more surprising that the two algorithms are closely connected. This connection provides a 
post-hoc justification of the various heuristics employed by CMA-ES, and it highlights the 
consistency of the intuition behind these heuristics. 



4.5 Elitism 

The principle of the NES algorithm is to follow the natural gradient of expected fitness. This 
requires sampling the fitness gradient. Naturally, this amounts to what, within the realm 
of evolution strategies, is generally referred to as "comma-selection", that is, updating the 
search distribution based solely on the current batch of "offspring" samples, disregarding 
older samples such as the "parent" population. This seems to exclude approaches that retain 
some of the best samples, like elitism, hill-climbing, or even steady-state selection ( Goldberg . 



In this section we show that elitism (and thus a wide variety of selection schemes) 



is indeed compatible with NES. We exemplify t his technique by der i ving a, NES algorithm 
with (1-1-1) selection, i.e., a hill-climber (see also Glasmachers et al. . 2010al ). 



It is impossible to estimate any information about the fitness gradient from a single 
sample, since at least two samples are required to estimate even a finite difference. The 
(1-|-1) selection scheme indicates that this dilemma can be resolved by considering two 
distinguished samples, namely the elitist or parent zP'^'''^'^'' = fx, and the offspring z. Con- 
sidering these two samples in the update is in principle sufficient for estimating the fitness 
gradient w.r.t. the parameters 9. 

Care needs to be taken for setting the algorithm parameters, such as learning rates 
and utility values. The extremely small population size of one indicates that learning rates 
should generally be small in or der to ensure stability of the algorithm. Another guideline 
is the well known observation ( Rechenberg and Eigen . 19731 ) that a step size resulting in 



a success rate of roughly 1/5 maximizes progress. This indicates that a self-adaptation 
strategy should increase the learning rate in case of too many successes, and decrease it 
when observing too few successes. 

Let us consider the basic case of radial Gaussian search distributions 

tt[z I /X, fj) = exp 



V2^a V 2cj2 

with parameters /x E M*^ and cr > 0. We encode these parameters as ^ = (/x,£) with 
i = log(cj). Let s ~ A/'(0, 1) be a standard normally distributed vector, then we obtain the 
offspring asz = /i-|-(j-s~ AA(/i, a), and the natural gradient components are 

J = u[''^ • + 4''^ • • s 

v,j = nS'^.(-i)+4^).(iisf -1). 
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The corresponding strategy parameter updates read 



a <^ a ■ exp ( rji ■ 



The indices 1 and 2 of the utihty values refer to the 'samples' n and z, namely parent and 
offspring. Note that these samples correspond to the vectors and s in the above update 
equations (see also section . The superscripts of the utility values indicate the different 
parameters. Now elitist selection and the 1/5 rule dictate the settings of these utility values 
as follows: 

• The elitist rule requires that the mean remains unchanged in case of no success {u\^^ = 
1 and u'^^ = 0) , and that the new sample replaces the mean in case of success {u['^^ = 
and tig'^^ = 1, with a learning rate of = 1). 

• Setting the utilities for £ to u^'^ = 1 and Ug^'' = in case of no success, effectively 

reduces the learning rate. Setting u\ = —5 and = in case of success has the 
opposite effect and roughly implements the 1/5-rule. The self-adaptation process can 
be stabilized with a small learning rate 

Note that we change the utility values based on success or failure of the offspring. This 
seems natural, since the utility of information encoded in the sample z depends on its 
success. Highlighting elitism in the selection, we call these utility values success-based. 
This is similar but not equivalent to rank-based utilities for the joint population {/i, z}. 

The NES hill-climber for radial Gaussian search distributions is illustrated in algo- 
rithm [6l This formulation offers a more standard perspective on the hill-climber by using 
explicit case distinctions for success and failure of the offspring instead of success-based 
utilities. The same procedure can be generalized to more flexible search distributions. A 
conservative strategy is to update further shape-related parameters only in case of success, 
which can be expressed by means of success-based utility values in the very same way. The 
corresponding algorithm for multi-variate Gaussian search distributions is Algorithm [5] (in 
section 132]) and for multi-variate Cauchy it is Algorithm [8] (in section [5^ . 

5. Beyond Multinormal Distributions 

In the previous section we have seen how natural gradient ascent is applied to multi-variate 
normal distributions, which arguably constitute the most important class of search distri- 
butions in modern evolution strategies. In this section we expand the NES framework in 
breadth, motivating the usefulness and deriving a number of NES variants with different 
search distributions. 

5.1 Separable NES 

Adapting the full covariance matrix of the search distribution can be disadvantageous, 
particularly in high-dimensional search spaces, for two reasons. 
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Algorithm 6: (1+1)-NES with radial Gaussian distribution 
input: /, /Xj^jj, ainit 

/best ^ fifJ-init) 

repeat 

draw sample s ~ ^f{0, 1) 
create offspring z /x + o" • s 
evaluate the fitness /(z) 
if /(z) > fbest then 
update mean ^ z 

a a ■ exp(5?7o-) 

/best ^ /(z) 
else 

I cj ^ (J • exp{-ri^) 
end 

until stopping criterion is met; 



For many problems it can be safely assumed that the computational costs are governed 
by the number of fitness evaluations. This is particularly true if such evaluations rely on 
expensive simulations. However, for applications where fitness evaluations scale gracefully 
with the search space dimension, the 0{d^) xNES update (due to the matrix exponential) 
can dominate the computation. One such application is the evolutionary training of recur- 
rent neural networks (i.e., neuroevolution) , where the number of weights d in the network 
can grow quadratically with the number of neurons n, resulting in a complexity of 0{n^) 
for a single NES update. 

A second reason not to adapt the full covariance matrix in high dimensional search 
spaces is sample efficiency. The covariance matrix has d{d + l)/2 G 0{d?') degrees of 
freedom, which can be huge in large dimensions. Obtaining a stable estimate of this matrix 
based on samples may thus require many (costly) fitness evaluations, in turn requiring very 
small learning rates. As a result, the algorithm may simply not have enough time to adapt 
its search distribution to the problem with a given budget of fitness evaluations. In this 
case, it may be advantageous to restrict the class of search distributions in order to adapt 
at all, even if this results in a less steep learning curve in the (then practically irrelevant) 
limit of infinitely many fitness evaluations. 

The only two distinguished parameter subsets of a multi-variate distribution that do 
not impose the choice of a particular coordinate system onto our search space are the 
'size' a of the distribution, corresponding to the (2d)-th root of the determinant of the 
covariance matrix, and its orthogonal complement, the covariance matrix normalized to 
constant determinant B (see section 14. 3p . The first of these candidates results in a standard 
evolution strategy without covariance adaptation at all, which may indeed be a viable option 
in some applications, but is often too inflexible. The set of normalized covariance matrices, 
on the other hand, is not interesting because it is clear that the size of the distribution 
needs to be adjusted in order to ensure convergence to an optimum. 

Thus, it has been proposed to give up some invariance properties of the search algorithm, 
and to adapt the class of search distribution with diagonal covariance matrices in some 
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predetermined coordinate system ( Ros and Hansen . 20081 ). Such a choice is justified in 
many apphcations where a certain degree of independ ence can be assu r ned a mong the 
fitness function parameters. It has even been shown in ( Ros and Hansen . 20081 ) that this 
approach can work surprisingly well even for highly non-separable fitness functions. 

Restricting the class of search distributions to Gaussians with diagonal covariance matrix 
corresponds to restricting a general class of multi-variate search distributions to separable 
distributions 



d 

p{z\e) = llp{zi\9,) , 

where p is a family of densities on the reals, and 6 = {6i, . . . , 9^) collects the parameters of 
all of these distributions. In most cases these parameters amount to 9i = {fj,i,ai), where 
/ij G M is a position and <Tj £ is a scale parameter (i.e., mean and standard deviation, 
if they exist), such that Zj = /ij + <Tj • Sj ~ p{- \ /ij, CTj) for Sj ~ p{- \ 0, 1). 



Algorithm 7: Separable NES (SNES) 



input: /, /Zj„ii, (Tinit 
repeat 

for A; = 1 ... A do 

draw sample ~ AA(0, 1) 

evaluate the fitness /(z^) 
end 

sort {(sfc,Zfc)} with respect to /(z^) and compute utilities 
V<.J^ELi^fc-(Sfc-l) 



compute gradients ^ ^ ^;)^ 2 



update parameters ^ ~'~ ^ ^^jl 

a ^a- exp(7?^/2 • V„J) 

until stopping criterion is met] 



Obviously, this allows us to sample new offspring in 0{d) time (per sample). Since the 
adaptation of each component's parameters is independent, the strategy update step also 
takes only 0{d) time. 

Thus, the sacrifice of invariance, amounting to the selection of a distinguished coordinate 
system, allows for a linear time algorithm (per individual), that still maintains a reason- 
able amount of flexibility in the search distribution, and allows for a considerably faster 
adaptation of its parameters. The resulting NES variant, called separable NES (SNES), 
is illustrated in algorithm [7] for Gaussian search distributions. Note that each of the steps 
requires only 0{d) operations. Later in this section we extend this algorithm to other search 
distributions, without affecting its computational complexity. 
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5.2 Rotationally-symmetric Distributions 

The class of radial or rotationally-symmetric search distributions are distributions with 
the property p{x) = p{Ux), for all x G M'^ and all orthogonal matrices U € M'^'^'^. Let 
Qt{^) be a family of densities of a rotationally symmetric probability distributions in 
with parameter r. Thus, we can write Qt{'^) = QtI't'^) with = ||zp for some family of 
functions q-r : M-*^ — )■ M-'^. In the following we consider classes of search distributions with 
densities 

^("1'^'^'^) =dillA)"'^^" (A-i)^(z-,x)in 

with additional transformation parameters /i G M"' and invertible A G M'^^'^. The function 
Qt is the accordingly transformed density of the variable s = (A~^) (z — /x). This setting 
is rather general. It covers many important families of distributions and their multi-variate 
forms, such as multi-variate Gaussians. In addition, parameters of the radial distribution, 
most prominently its tail (controlling whether large mutations are common or rare) can be 
controlled with the parameter r. 

We apply the procedure presented in section 14.21 to this more general case. In local 
exponential coordinates 

{d,M) ^ (/^j^g^. Anew) = ^/x-h A'^(5, Aexp Qm 

we obtain the three components of log-derivatives 

V5,m,tL=o,m=o log7r(z| /I, A,r,(5,M) = {gs,9M,gT) , 



where q'^ = q^-i^ Qt denotes the derivative of qj- with respect to r^, and VtQt denotes the 
gradient w.r.t. r. 

In the special case of Gaussian search distributions, q does not depend on a parameter 
T and we have 

resulting in gs = s and g-^ = i (ss"'' — I) , recovering equations (jlip and (fT2 
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5.2.1 Sampling from Radial Distributions 

In order to use this class of distributions for search we need to be able to draw samples 
from it. The central idea is to first draw a sample s from the 'standard' density 7r(s | /x = 
0, A = I, t), which is then transformed into the sample z = A~''s + /x, corresponding to 
the density 7r(z | A, /i, r). In general, sampling s can be decomposed into sampling the 
(squared) radius component = ||z|p and a unit vector v G M"^, ||v|| = 1. The squared 
radius has the density 

where r(-) denotes the gamma function. In the following we assume that we have an efficient 
method of drawing samples from this one-dimensional density. Besides the radius we draw 
a unit vector v E M*^ uniformly at random, for example by normalizing a standard normally 
distributed vector. Then s = r • v is effectively sampled from 7r(s | /x = 0, A = I, r), and 
the composition z = rA~'^v + follows the density 7r(z | /x. A, r). In many special cases, 
however, there are more efficient ways of sampling s = r • v directly. 

5.2.2 Computing the Fisher Information Matrix 

In section 14.21 the natural gradient coincides with the plain gradient, because coordinate 
system is constructed in such a way that the Fisher information matrix for the parameters 
(5, M) is the identity. However, this is in general not possible in the presence of parameters 
r, typically controlling the radial shape, and in particular the tail, of the search distribution. 

Consider the case r G . The dimensions of d and M are d and d{d+l)/2, respectively, 
making for a total number of m = d{d + 3)/2 + d' parameters. Thus, the Fisher information 
matrix is an (m x m) matrix of the form 



^1 with . = l^i^^ G ,= 9Hog^{.) 

cj d{6,M)dT (9-r2 



Using the Woodbury identity, we compute the inverse of the Fisher matrix as 

1 _ / I v\ _ A + Hvv'^ -Hv 
~ c) ~ \ -Hv'^ H 

with H = [c — v'^v)~^, and exploiting = H. The natural gradient becomes 

p-i f{9d,9M)-Hv{v'^{gs,gM)-g-r)\ 
* V H{v'^{gs,gM)-g.) J ' 

which can be computed efficiently in only 0{d'^ + md') operations. Assuming that d' does 
not grow with d, this complexity corresponds to 0{d^) operations in terms of the search 
space dimension, compared to 0{d^) operations required for a naive inversion of the full 
Fisher matrix. In other words, the benefits of the natural coordinate system carry over, 
even if we no longer have F = I. 
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5.3 Heavy-tailed NES 

Natural gradient ascent and plain gradient ascent in natural coordinates provide two equiv- 
alent views on the working principle of NES. In this section we introduce yet another 
interpretation, with the goal of extending NES to heavy-tailed distributions, in particu- 
lar distributions with infinite variance, like the Cauchy distribution. The problem posed 
by these distributions within the NES framework is that they do not induce a Riemannian 
structure on the parameter space of the distribution via their Fisher information, which ren- 
ders the information geometric interpretation of natural coordinates and natural gradient 
ascent invalid. 

Many important types of search distributions have strong invariance properties, e.g., 
multi-variate and heavy-tailed distributions. In this still very general case, the NES principle 
can be derived solely based on invariance properties, without ever referring to information 
geometric concepts. 

The direction of the gradient VgJ{9) depends on the inner product (•, •) in use, cor- 
responding to the choice of a coordinate system or an (orthonormal) set of basis vectors. 
Thus, expressing a gradient ascent algorithm in arbitrary coordinates results in (to some 
extent) arbitrary and often sub-optimal updates. NES resolves this dilemma by relying 
on the natural gradient, which corresponds to the distinguished coordinate system (of the 
tangent space of the family of search distributions) corresponding to the Fisher information 
metric. 

The natural coordinates of a multi-variate Gaussian search distribution turn out to be 
those local coordinates w.r.t. which the current search distribution has zero mean and unit 
covariance. This coincides with the coordinate system in which the invariance properties of 
multi-variate Gaussians are most apparent. This connection turns out to be quite general. 
In the following we exploit this property systematically and apply it to distributions with 
infinite (undefined) Fisher information. 

5.3.1 Groups of Invariances 

The invariances of a search distribution can be expressed by a group Q of (affine) linear 
transformations. Typically, ^ is a sub-group of the group of orthogonal transformations 
(i.e., rotations) w.r.t. a local coordinate system. For the above example of a rotationally- 
symmetric density Q : M*^ — > M.^ (e.g., a Gaussian), the densities 

with /I G M'^ and A G M'^^'^, det(A) form the corresponding multi-variate distribution. 
Let G(n^A) be the group of invariances of 7r(z | /i. A), that is, ^(/x,a) = {d \ ^(^(z) | A) = 
7r(z|/i,A)Vz G M'^}. We have ^(0,1) = 0(.,.)(M'^) = {g \ {g{z),g{z')) = (z,z')Vz,z' G M*^}, 
where the right hand side is the group of orthogonal transformations w.r.t. an inner product, 
defined as the (affine) linear transformations that leave the inner product (and thus the 
properties induced by the orthonormal coordinate system) invariant. Here the inner product 
is the one w.r.t. which the density Q is rotation invariant. For general (/2, A) we have 
^(/x.A) = f^°G{o,l)°^~^ ^ where h{z) = Az+/2 is the affine linear transformation corresponding 
to the current search distribution. In general, the group of invariances is only a subgroup 
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of an orthogonal group, e.g., for a separable distribution Q is the finite group generated 
by coordinate permutations and axis flips. 

We argue that it is most natural to rely on a gradient or coordinate system which is 
compatible with the invariance properties of the search distribution in use. In other words, 
we should ensure the compatibility condition 



for the inner product (•, •) with respect to which we compute the gradient VJ. This con- 
dition has a straight-forward connection to the natural coordinate system introduced in 
section W72\ It is fulfilled by performing all updates in local coordinates, in which the cur- 
rent search distribution is expressed by the density 7r(- 1 0, 1) = Q(-). In these coordinates, 
the distribution is already rotationally symmetric by construction (or similar for separable 
distributions), where the rotational symmetry is defined in terms of the 'standard' inner 
product of the local coordinates. Local coordinates save us from the cumbersome explicit 
construction of an inner product that is left invariant by the group ^(^^a)- 

Note, however, that <5(z) and Q{a ■ z) have the same invariance properties. Thus, the 
invariance properties make only the gradient components V^J and VbJ unique, but not 
the scale component Vg-J. Luckily this does not affect the (1+1) hill-climber variant of 
NES, which relies on a success-based step size adaptation rule (see section [53]) . Also note 
that this derivation of the NES updates works only for families of search distributions with 
strong invariance properties, while natural gradient ascent extends to much more general 
distributions, such as mixtures of Gaussians. 

5.3.2 Cauchy Distributions 

Given these results, NES, formulated in local coordinates, can be used with heavy-tailed 
search distributions without modification. This applies in particular to the (1+1) hill- 
climber, which is the most attractive choice for heavy-tailed search distributions, because 
when the search distribution converges to a local optimum and a better optimum is located 
by a mutation, then averaging this step over the offspring population will usually result in a 
sub-optimal step that stays within the same basin of attraction. In contrast, a hill-climber 
can jump straight into the better basin of attraction, and can thus make better use the 
specific advantages of heavy-tailed search distributions. 

Of course, the computation of the plain gradient changes depending on the distribution 
in use. Once this gradient is computed in the local coordinate system respecting the invari- 
ances of the current search distribution, it can be used for updating the search parameters 
H and B without further corrections like multiplying with the (in general undefined) inverse 
Fisher matrix. For the multi-variate Cauchy distribution we have 



which results in the gradient components 





• ss 



2 
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Algorithm 8: (H-l)-NES with multi-variate Cauchy distribution 
input: /, /Xj^jj, 'Etnu = A^A 

/best < OO 

repeat 

s ~7V(0,I) 

draw sample r ~ 7rcauchy{0, 1) 

z <— rA^s + fi 

evaluate the fitness /(z) 
(S1,S2) ^ (0,s) 
(Z1,Z2) ^ (At,z) 

if /(z) > /fees* then 
update mean fj, ^ z 

/best ^ /(z) 
U^(-4,l) 

else 

I u^(f,0) 
end 

calculate log-derivatives Vm log vr (zfc|0) i ( — Sksl — I 

2 + 1 

Vm-'' ^ ^ZLi^M log7r(zfc|6') -Ufc 

A^ A-exp(i?7AVM^) 
until stopping criterion is met; 



The full NES hill-climber with multi-variate Cauchy mutations is provided in algorithm [HI 
6. Experiments 

In this section, we empirically validate the new algorithms, with the goal of answering the 
following questions: 

• How do NES algorithms perform compared to state-of-the-art evolution strategies? 

• Can we identify specific strengths and limitations of the different variants, such as 
SNES (designed for separable problems) and Cauchy-NES (with heavy-tailed distri- 
bution)? 

• Going beyond standardized benchmarks, should natural evolution strategies be ap- 
plied to real-world problems? 

We conduct a broad series of experiments on standard benchmarks, as well as more spe- 
cific experiments testing special capabilities. In total, six different algorithm variants are 
tested and their behaviors compared qualitatively as well as quantitatively, w.r.t. different 
modalities. 

We start by detailing and justifying the choices of hyperparameters, then we proceed to 
evaluate the performance of a number of different variants of NES (with and without the 
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parameter 


default value 


A 

= VB 


4+ L31og((i)J 
1 

(9 + 31og((i)) 
bdVd 




Va- 


(3 + log(d)) 




Uk 


max (O,log(| + 1) - lo^ 


;(^)) 1 


Ei=imax(0,log(| + 1)- 


log(i)) ^ 



Table 1: Default parameter values for xNES and SNES (including the utility function) as 
a function of problem dimension d. 



importance mixing and adaptation sampling techniques) on a broad collection of bench- 
marks. We further conduct experiments using the separable variant on high-dimensional 
problems, and address the question of global optimization and to what degree a heavy-tailed 
search distribution (namely multivariate Cauchy) can alleviate the problem of getting stuck 
in local optima. 



6.1 Experimental Setup and Hyperparameters 

Across all NES variants, we distinguish three hyperparameters: the population size A, the 
learning rates rj and the utility function u (because we always use fitness shaping, see 
section FS.ip . In particular, for the multivariate Gaussian case (xNES) we have the three 
learning rates r/^, Va, and vb- 

It is highly desirable to have good default settings that scale with the problem dimension 
and lead to robust performance on a broad class of benchmark functions. Table [T] provides 
such default values as function s of the problem dimension d fo r xNES. We borrowed several 
of the settings from CMA-ES ( Hansen and Qstermeier . 200 ll ). which seems natural due to 



the apparent similarity discussed in section [331 Both the population size A and the learning 
rate Vfi ^Lie the same as for CMA-ES, even if this learning rate never explicitly appears in 
CMA-ES. For the utility function we copied the weighting scheme of CMA-ES, but we 
shifted the values such that they sum to zero, which is the simplest form of implementing 
a fitness baseline; iJastrebski and Arnoldl tood ) proposed a similar approach for CMA-ES. 



The remaining parameters were determined via an empirical investigation, aiming for robust 
performance. In addition, in the separable case (SNES) the number of parameters in the 
covariance matrix is reduced from d{d + l)/2 S 0{d^) to d S 0(d), which allows us to 
increase the learning rate 7?nr by a factor of d/2> € 0{d), a choice which has proven robust 
in practice ( Ros and Hansen . 20081 ). 



The six algorithm variants that we will be evaluating below are xNES (algorithm H]) , 
its hill-climber variant (l-l-l)-xNES (see algorithm [5]) , "xNES-im-as" , that is xNES using 
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both importance mixing (section l3.2p and adaptation sampling (section 13. 3p . the separable 
SNES (as in algorithm [7|), its own hill-climber variant (1+1)-SNES (pseudocode not shown), 
and finally the heavy-tailed variant (l-l-l)-NES-Cauchy (as in algorithm [8]). A Python 
implementation of all these (and mor e) is available within the open-source machine learning 
library PyBrain dSchaul et alJ . l2ninh . 



6.2 Black-box Optimization Benchmarks 

For a practitioner it is important to understand how NES algorithms compare to other 
methods on a wide range black-box optimization scenarios. Thus, we evaluate our algorithm 
on all the benchmark functions of the 'Black-Box Optimization Benchmarking' collection 
(BBOB) from the 2010 GECCO Workshop for Real-Parameter O ptimization. The collection 
consists of 24 noise- free f unctions (12 unimodal, 12 multimodal; Hansen and Finck . 2010al ) 



and 30 noisy functions ()Hansen and Finckl . l2010bl). In order to ma ke our results fully 



comparable, we also use the identical setup (iHansen and Augerl . l20inl ). which transforms 



the pure benchmark functions to make the parameters non-separable (for some) and avoid 
trivial optima at the origin. To facilitate the comparison for the reader without overcrowding 
the plots, we provide the GECCO 2010 results of (1,4)-CMA-ES alongside our own (chosen 
as the most comparable representative among the 13 CMA-ES-based submissions to the 
workshop) . 

On all of these benchmarks, we compare xNES (as described in Algorithm and xNES- 
im-as, that is, the same algorithm but augmented with both importance mixing and adap- 
tation sampling. On the multi-modal, as well as the noisy benchmarks, we also use the 
restart strategy (see section 13. 4p . 

For the unimodal benchmarks (see Figure [7|), we plot how the performance of xNES 
scales with problem dimension (between 2 and 40). Shown are the median number of evalu- 
ations required to reach a fitness of —10"'^. We find that xNES is on par with CMA-ES for 
most benchmarks, but generally more robust (e.g., on the StepEllipsoid benchmark). Em- 
ploying importance mixing and adaptation sampling further increases performance (most 
significantly on the simple benchmarks like the Sphere function), but at the cost of robust- 
ness. 

For the multi-modal benchmarks (see Figured]), as well as the noisy benchmarks (see 
Figures [9l and [T0|) . where not all runs succeed, we instead show the cumulative success rates 
(again, for reaching a fitness of —10"'^), for problem dimension d = 5. We find that xNES 
together with our restart strategies lead to very good performance, clearly outperforming 
CMA-ES on almost all multi-modal functions, and many noisy functions. For xNES-im-as, 
the results correspond to the expected trade-off that these techniques improve performance, 
with a substantial boost on the noise-free multi-modal functions, but reduce robustness, as 
is evident from the results on many of the harder noisy benchmarks (only attenuated in 
part by the restart strategy). 

6.3 Separable NES 

The SNES algorithm is expected to perform at least as well as xNES on separable problems, 
while it should show considerably worse performance in the presence of highly dependent 
variables. We first present a number of experiments on standard benchmarks with the aim of 
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Figure 7: Unimodal benchmarks. Log-log plot of the median number of fitness evaluations 
(over 100 trials) required to reach the target fitness value of — 10~^ for the 12 uni- 
modal benchmark functions on dimensions 2 to 40. Disconnected (smaller) plot 
markers denote cases where the corresponding algorithm converged prematurely 
in at least 50% of the runs (cases for which 90% or more prematurely converged 
are not shown at all). No data was available for (1,4)-CMA-ES on dimension 
40. Note that xNES (red triangles) consistently solves all benchmarks, with a 
scaling factor that is almost the same over all functions. Also, xNES appears to 
be more stable, especially on the SharpRidge function. When employing impor- 
tance mixing and adaptation sampling (xNES-im-as, green squares), performance 
increases, most substantially on the Sphere function, while robustness decreases. 
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Number of evaluations 



Figure 8: Multi-modal benchmarks, comparing the performance of xNES with restart 
strategies (dashed red) and (1,4)-CMA-ES (dotted black). Shown is the em- 
pirical cumulative success rate (over 100 runs, 15 for CMA-ES). Note that xNES 
clearly outperforms CMA-ES on all functions except /21 and /22 (both of which 
are simply combinations of 101 and 21 Gaussian peaks, respectively, without 
any global structure), where the results are very similar. When additionally em- 
ploying importance mixing and adaptation sampling (xNES-im-as, solid green), 
performance improves even further. 
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Figure 9: Noisy benchmarks, comparing the performance of xNES with restart strategies 
(dashed red), and (1,4)-CMA-ES (dotted black). Shown is the empirical cumula- 
tive success rate (over 100 runs, 15 for CMA-ES). The benchmarks are grouped 
by type of noise (vertical) and underlying function (horizontal). Generally speak- 
ing, Cauchy-noise is the least harmful, as the rare outliers do not affect the ranks 
all that much. We find that xNES outperforms CMA-ES on the majority of the 
functions. Additionally employing importance mixing and adaptation sampling 
(xNES-im-as, solid green), however, improves performance only on some of the 
functions. Continued in Figure [TOl 
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Figure 10: Noisy benchmarks (continued). 
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Figure 11: Comparison of the performance of xNES (red circles) and SNES (blue triangles) 
on a subset of the unimodal BBOB benchmark functions. The log-log plots show 
the median number of evaluations required to reach the target fitness 10~^, for 
problem dimensions ranging from 2 to 16 (over 20 runs). The first 4 benchmark 
functions are separable, the other three are not. The inverted triangles indicate 
cases where SNES converged to the optimum in less than 90% of the runs. 



understanding its behavior and in particular its limitations in more detail. Furthermore, the 
algorithm is specifically designed to scale gracefully to high-dimensional search problems. 
We thus apply SNES to two tasks with a scalable and considerably high search space 
dimension, namely neuro-evolutionary controller design, and the problem of finding low 
energy states in Lennard-Jones potentials. 



6.3.1 Separable and Non-separable Benchmarks 



First, we ev aluate SNES on a subset o f the unimodal benchmark problems from the BBOB 
framework (IHansen and Finckl . l2010al : iHansen and Auged . l20inl ). These benchmarks test 
the capability of SNES to descend quickly into local optima, a key property of most evo- 
lution strategies. The results in figure [TT] show how SNES dominates when the function is 
separable (/i through /g), and converges much slower than xNES in non-separable bench- 
marks, as expected. In particular, on the rotated ellipsoid function /lo, which is designed 
to make separable methods fail, SNES requires 4 orders of magnitude more evaluations. In 
dimensions d > 2 it fails completely because the resolution of double precision numbers is 
insufficient for this task. 



6.3.2 Neuro-evolution 

In the second experiment, we show how SNES is well-suited for neuroevolution problems 
because they tend to be high-dimensional, multi-modal, but with highly redundant global 
optima (there is not a unique set of weights that defines the optimal behavior). In particular, 
we run it on Non-Markovian double pole balancing, a task which involves balancing two 
differently sized poles hinged on a cart that moves on a finite track. The single control 
consists of the force F applied to the cart, and observations include the cart's position and 
the poles' angles, but no velocity information, which makes this task partially observable. 
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Figure 12: Left: Plotted are the cumulative success rates on the non-Markovian double- 
pole balancing task after a certain number of evaluations, empirically determined 
over 100 runs for each algorithm, using a single tanh-unit {n = 1) (i.e., optimiz- 
ing 4 weights). We find that all three algorithm variants give state-of-the-art 
results, with a slightly faster but less robust performance for xNES with impor- 
tance mixing and adaptation sampling. Right: Median number of evaluations 
required to solve the same task, but with increasing number of neurons (and 
corresponding number of weights). We limited the runtime to one hour per 
run, which explains why no results are available for xNES on higher dimensions 
(cubic time complexity). The fact that SNES quickly outperforms xNES, also 
in number of function evaluations, indicates that the benchmark is (sufficiently 
close to) separable, and it is unnecessary to use the full covariance matrix. For 
reference we also plo t the corresponding results of the previously best performing 
algorithm CoSyNE dOomez et al.l . I2OO8I ). 



It provides a perfect testbed for algori thms focusing o n learning fine control with memory 
in continuous state and action spaces (jWielandl . Il99ll ) . The controller is represented by a 
simple recurrent neural network, with three inputs, (position x and the two poles' angles /3i 
and /32), and a variable number n of tanh units in the output layer, which are fully connected 
(recurrently), resulting in a total of n{n + 3) weights to be optimized. The activation of 
the first of these recurrent neurons directly dete r mines the force to be applied. We use the 



implementation found in PyBrain dSchaul et al.1 . 12OI0I ). 



An evaluation is considered a success if the poles do not fall over for 100, 000 time 
steps. We experimented with recurrent layers of sizes n = 1 to n = 32 (corresponding to 
between 4 and 1120 weights). It turns out that a single recurrent neuron is sufficient to solve 
the task (Figure [T2l left). In fact, both the xNES and SNES result s are state-of-the-art, 
outperforming the previously best algorithm (CoSyNE; Gomez et al. . 20081 . with a median 
of 410 evaluations) by a factor two. 

In practical scenarios however, we cannot know the best network size a priori, and thus 
the prudent choice consists in overestimating the required size. An algorithm that graciously 
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scales with problem dimension is therefore highly desirable, and we find (Figure [T2\ right) 
that SNES is exhibiting precisely that behavior. The fact that SNES outperforms xNES with 
increasing dimension, also in number of function evaluations, indicates that the benchmark 
is separable, and it is unnecessary to use the full covariance matrix. We conjecture that this 
a property shared with the majority of neuroevolution problems that have enough weights 
to exhibit redundant global optima (some of which can be found without considering all 
parameter covariances). 



6.3.3 Lennard-Jones Potentials 



In our third benchmark, we show the performance of SNES on the widely studied problem of 
minimizing t he Lennard-Jones atoni cluster potentials, which is known for being extremely 
multi- modal ( Wales and Dove . 19981 ). For that reason we employ the separable hill-climber 
variant (l-l-l)-SNES. The objective consists in finding that configuration of atoms which 
minimizes the potential energy function 



12 



where rjj is the distance between atoms i and j (see also figure [T3] for an illustration). For 
the setup here, we initialized near and the step-sizes at cTj = 0.01 to avoid jumping 
into a local optimum in the fist generation. The results are plotted in figure [T4l showing 
how SNES scales convincingly to hundreds of parameters (each run up to 500(i function 
evaluations) . 




6.4 Heavy Tails and Global Optimization 

Can NES algorithms with heavy-tailed search distributions enhance the capability of escap- 
ing local optima? Our tests with the extremely heavy-tailed Cauchy distribution investigate 
the handling of multi-modality. 
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Figure 14: Performance of (1+1)-SNES on the Lennard-Jones benchmark for atom clus- 
ters ranging from 3 to 67 atoms (corresponding to problem dimensions d of 9 
to 20 1). The yellow diamon ds indicate the best known configurations (taken 
from (| Wales and Dovel . ll998l ll. and the box-plots show upper and lower quartile 
performance (the red line being the median) of SNES, over 100 runs. The inset 
is a zoom on the behavior in small dimensions, where SNES succeeds in locating 
the true optimum in a large fraction of the runs. 



The first benchmark function is 



f2Rosen{z) = min <^ fsi-z -W),5 + fs 



10 



where /§ is the well-known Rosenbrock function ( Hansen and Finck . 2010al ). and the trans- 
formation (z— 10)/4 is component- wise. Our variant has a deceptive double-funnel structure, 
with a large valley containing a local optimum and a smaller but deeper valley containing 
the global optimum. The global structure will tend to guide the search towards the local 
optimum (see also figure [T5\ left, for an illustration). For this experiment, the search dis- 
tribution is initialized at mid-distance between the two optima, and the initial step-size a 
is varied. Figure [TCTright) shows the proportion of runs that converge to the global opti- 
mum, instead of the (easier to locate) local one, comparing for a multivariate Cauchy and 
Gaussian (l-M)-NES. 

The second experiment uses the following 'random-basin' benchmark function: 



/rb(z) = 1- 



9 
10 
1 

10 



zi 
10 



.10 



r([zi 



1=1 
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to investigate the degree to which a heavy-tail distribution can be useful when the ob- 
jective function is highly multi- modal, but there is no global structure to exploit. Here 
r : Z'^ ^ [0, 1] is a pseudo-random number generator, which approximates an i.i.d. uni- 
formly random distribution for each tuple of integers, while still being deterministic, i.e., 
each tu ple evaluates to the same value each time. In practice, we implement it as a Mersenne 



twister (jMatsumoto and Nishimural . Il998l ). seeded with the hash- value of the integers. Fur- 
ther, to avoid axis- alignment, we rotate the function by multiplying with an orthonormal 
random d x d matrix. 

One interesting property of this function is that each unit-sized hypercube is an "at- 
tractor" of a local optimum. Thus, while sampling points from one hypercube, an ES 
will contract its search distribution, making it harder to escape from that local optimum. 
Furthermore, the values of the local optima are uniformly distributed in [0, 1], and do not 
provide a systematic global trend (in contrast to the Rastrigin function) . If the optimization 
results in a value of, say, 0.11, then we know that only 11% of the local optima are better 
than this. 

Figure [16] shows the results: not surprisingly, employing the Cauchy distribution for 
search permits longer jumps, and thus enables the algorithm to find better local optima on 
average. The Cauchy version outperforms the Gaussian version by a factor of two to three, 
depending on the problem dimension. Note that the improvement (for both distributions) is 
due the number of neighbor-cubes increasing exponentially with dimension, thus increasing 
the chance that a relatively small jump will reach a better local optimum. At the same time, 
the adaptation of the step size is slowed down by the dimension-dependency of the learning 
rate, which leaves the algorithm more time to explore before it eventually converges into 
one of the local optima. 



6.5 Results Summary 

Our results have a number of implications. First of all, the results on the BBOB benchmarks 
show that NFS algorithms are competitive with the state-of-the-art across a wide variety 
of black-box optimization problems. 

Beyond this very general statement, we have demonstrated advantages and limitations of 
specific variants, and as such established the generality and flexibility of the NFS framework. 
Fxperiments with heavy-tailed and separable distributions demonstrate the viability of the 
approach on high-dimensional and complex, deceptively multi-modal domains. We obtained 
best reported results on the difficult task of training a neural controller for double pole- 
balancing. This test, together with good results on the Lennard-Jones problem, show the 
feasibility of the algorithm for real- world search and optimization problems. 

The multi-start strategy, although simple and non- adaptive in spirit, brings considerable 
improvements on many difficult multi-modal benchmarks. Its technique of interleaving 
multiple runs has advantages over truly sequential restart strategies, waiting for convergence 
before restarting. 

In summary, our results demonstrate that it is indeed possible for an algorithm with a 
clean derivation from first principles to achieve state-of-the-art to best performance in the 
heuristics-dominated field of black-box optimization. 
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Fi gure 15: Left: Contour plot of the 2-d.imensional Double-Rosenbrock function f2Rosem 
illustrating its deceptive double-funnel structure. The global structure leads the 
search to the local optimum ((14,14), red circle), whereas the true optimum ((- 
11,-11), black square) is located in the smaller valley. Right: Empirical success 
probabilities (of locating the global optimum), evaluated over 200 runs, of 1000 
function evaluations each, on the same benchmark, while varying the size of the 
initial search distribution. The results clearly show the robustness of using a 
heavy-tailed distribution. 



7. Discussion 



Technique 


Issue addressed 


Applicability 


Relevant 






limited to 


section 


Natural gradient 


Scale-invariance, many more 






Fitness shaping 


Robustness 






Importance mixing 


Performance, parameter sensitivity 






Adaptation sampling 


Performance, parameter sensitivity 




[33] 


Restart strategies 


Reduced sensitivity to local optima 






Exponential parameterization 


Covariance constraints 


Radial 




Natural coordinate system 


Efficiency 


Radial 





Table 2: Summary of enhancing techniques 



Table [2] summarizes the various techniques we introduced. The plain search gradient 
suffers from premature convergence and lack of scale invariance (see section [212]) . Therefore, 
we use the natural gradient instead, which turns NFS into a viable optimization method. To 
improve performance and robustness, we introduced several novel techniques. Fitness shap- 
ing makes the NFS algorithm invariant to order-preserving transformations of the fitness 
function, thus increasing robustness. Importance mixing reduces the number of necessary 
samples to estimate the search gradient, while adaptation sampling adjusts NES's learn- 



44 



Natural Evolution Strategies 




Figure 16: Left: Contour plot of (one instantiation of) the deceptive global optimization 
benchmark function f^b, in two dimensions. It is constructed to contain local 
optima in unit-cube-sized spaces, whose best value is uniformly random. In 
addition, it is superimposed on IC^-sized regional plateaus, also with uniformly 
random value. Right: Value of the local optimum discovered on frb (averaged 
over 250 runs, each with a budget of lOOd function evaluations) as a function of 
problem dimension. Since the locally optimal values are uniformly distributed 
in [0, 1], the results can equivalently be interpreted as the top percentile in which 
the found local optimum is located. E.g., on the 4-dimensional benchmark, NES 
with the Cauchy distribution tends to find one of the 6% best local optima, 
whereas employing the Gaussian distribution only leads to one of the best 22%. 



ing rates online. Empirically we found that using both importance mixing and adaptation 
sampling yields highly performant results on the standard benchmarks. In addition, restart 
strategies substantially improve success probabilities on multi-modal functions and noisy 
benchmarks, clearly outperforming alternative approaches. Last, the exponential parame- 
terization is crucial for maintaining positive-definite covariance matrices, and the use of the 
natural coordinate system guarantees computational feasibility. 

NES applies to general parameterizable distributions. In this paper, we have experimen- 
tally investigated three variants, adjusted to the particular properties of different problem 
classes. We demonstrated the power of the xNES variant using a full multinormal distribu- 
tion, which is invariant under arbitrary translations and rotations, on the canonical suite 
of standard benchmarks. Additionally, we showed that the restriction of the covariance 
matrix to a diagonal parameterization (SNES) allows for scaling to very high dimensions, 
both on the difficult non-Markovian double pole balancing task and the Lennard-Jones clus- 
ter potential optimization problem. Furthermore, we demonstrated that using heavy-tailed 
distributions (Cauchy) instead of Gaussian distributions yields substantial benefits in global 
optimization scenarios with multiple or deceptive local optima. 
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Unlike many black-box optimization algorithms, NES boasts a clean derivation from 
first princi ples. The relationship o f NES to methods from other fields, notably evolution 



strategies (Hansen and Ostermeiei . 2001) a nd policy gradients ( Peters and Schaal . 20081 : 



Kakade . I2OO2I : iBagnell and Schneideil . l2003l ). should be evident to readers familiar with 



both of these domains, as it marries the concept of fitness-based black-box optimization 
from evolutionary methods with the concept of Monte Carlo-based gradient estimation 
from the policy gradient framework. 



7.1 Future Work 



One intriguing possibility of extending NES starts by realizing that NES is not limited 
to continuous (flat) domains. In fact, NES can be designed to directly operate on discrete 
search spaces (e.g., graphs) or distributions with deep structure, s uch as deep belief networks 
and d eep Boltzmann machines (jHinton and Salakhutdinovl . l2006l : ISalakhutdinov and Hintonl . 
2OO9I ). This would allow us to find more complex solutions by combining and recombin- 
ing building blocks produced by a deep network, opening up an interesting connection to 
genetic algorithms and their cross-over operators. 

Program space constitutes one discrete search space of particular interest, and it seems 

proin ising to extend the NES approach to program search (genetic programming. iKozal. 

1992 ) , building upon probabilis tic representations of program trees (e.g.. lSalustowicz and Schmidhuber 
19971 : iBosman and Jongl . l2004l ) for which the natural gradient can be estimated. 

Another possible direction of future research could comprise the investigation of sparsely 
parametrized distributions by means of which we can exploit selected covariance structure 
while keeping linear complexity, even in high dimensions. T his structure c an be either given 
in the problem domain (e.g., grouping weights by neuron dComez et al.l . I2OO8I )) or learned 
incrementally. 

Lastly, the approach is not necessarily restricted to purely following the gradient on 
expected fitness; for example, the objective could be formulated diffe rently, as in including 
a trade-off between fitness and information gain (|Schmidhubeil . l2009l ). 



7.2 Conclusion 

In this paper, we introduced Natural Evolution Strategies, a novel family of algorithms 
that constitutes a principled alternative to standard stochastic search methods such as 
evolutionary algorithms. Maintaining a parameterized distribution on the set of solution 
candidates, the natural gradient is used to update the distribution's parameters in the 
direction of higher expected fitness. 

A collection of techniques have been introduced, which addresses issues of convergence, 
robustness, computational complexity, sensitivity to hyperparameters, and sampling meth- 
ods for algorithm speed. We investigated a number of instantiations of the NES family, 
ranging from general-purpose multi-variate normal distributions to heavy-tailed and sepa- 
rable distributions specialized in global optimization and high dimensionality, respectively. 
The results show best published performance on various standard benchmarks, as well as 
competitive performance on others. 
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In conclusion, NES algorithms are high-performing, derived from first principles and 
easy to implement. Their clean conceptual framework allows for a broad range of intriguing 
future developments. 
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Appendix A: Weighted Mann- Whitney Test 

This appendix defines a Weighted Mann- Whitney test, as used in section [3.31 for virtual 
adaptation sampling. Although the derivations are trivial, the authors are not aware of any 
previous attempt to create a test with the same purpose. 

Background. The classical Mann-Whitney test determines (with confidence p) whether 
two sets of samples S = {si} and S' = {s[} are likely to come from the same distribution. 
For that, the so-called U-statistic is computed: 

Si>s'. *i = «j 

Let p = and a = \J ^^ ^^12 ' where n and n' are the number of samples in S 
and 5", respectively. We can then determine the significance of the difference between S 
and S' . They are different with confidence p if: 

• ^( ^~^ ) > 1 — p (if S has larger values), or 

• ^{ ^^^ ) < p (if 5' has larger values). 

Introducing Weights. Now, assume that every sample in S and S' has a (positive) 
weight (wi or w'^) associated to it. We can generalize the Mann- Whitney test by interpreting 
the weights as fractional number of occurrences in the sets: 

Accordingly, we also need to adjust the number of samples: m = Y17=i ^* ~ 

Eti < and thus p = ^anda = ,J^^^±R, 

We can see this weighted U-statistic as an interpolation between the cases covered 
classical one. In fact, if the weights are integers, a sample s with weight w can be replaced 
equivalently by w occurrences of the same sample s (each with weight 1). 
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